专利摘要:
METHOD AND APPARATUS FOR EXTRACTION AND WORD QUALITY ASSESSMENT. The present invention relates to a method and an apparatus for extracting word quality assessment. The method includes: calculating a Document Frequency (DF) of a word in a categorized data mass; evaluate the word in multiple singular aspects according to the DF of the word; and evaluate the word in multiple aspects according to the multiple singular aspect evaluations to obtain a weight of importance of the word. According to the solution of the present invention, the importance of the word in a categorized mass of data can be assessed, and words with high quality can be obtained through an integrated assessment.
公开号:BR112012011091B1
申请号:R112012011091-8
申请日:2010-06-28
公开日:2020-10-13
发明作者:Huaijun Liu;Zhongbo Jiang;Gaolin Fang
申请人:Tencent Technology (Shenzhen) Company Limited;
IPC主号:
专利说明:

FIELD OF THE INVENTION
[0001] The present invention relates to information processing techniques on the Internet, and more particularly, to a method and apparatus for extracting and evaluating word quality. BACKGROUND OF THE INVENTION
[0002] With the rapid development of the Internet, the problem of "information overload" becomes increasingly serious. When people enjoy the convenience brought by the Internet, they are also inundated with the mass of information on the Internet. There is an urgent need to solve the problem of how to extract effective information from the mass of Internet data in a more accurate and effective way.
[0003] Currently, there are several types of Internet platforms. They provide large amounts of data for users. Among them, there are familiar search engines, for example, Google, Baidu, Soso; there are also interactive Q&A platforms, for example, Zhidao, Wenwen, Answers; and also popular blog platforms, for example, Qzone, Sina blog, etc.
[0004] All of these Internet platforms require a natural language text processing technique, that is, to extract effective information from the data mass for processing. Natural language text processing is analyzing the syntax of a document, for example, analysis of categorization, grouping, summarization, similarity. Since each document is made up of words, each detailed technique in natural language text processing requires word comprehension. Therefore, how to accurately assess the importance of a word in a sentence becomes an important problem to be researched.
[0005] For example, as for a sentence “China has a long history, the great wall and terracotta army are the pride of China”, in which the words “China”, “great wall”, “terracotta army” and “ history ”are obviously more important than the others.
[0006] The extraction and assessment of word quality is to determine at an appropriate quality level for a candidate word. For example, there can be three levels, that is, important, common and used constantly. Then, important words are selected. After that, common words and constantly used words are selected. Therefore, when a document is analyzed, important words can be considered first, common words can be taken as supplementation, so that constantly used words can be filtered out completely.
[0007] Currently, a method of extracting and evaluating word quality based on mass data is usually implemented by calculating a Document Frequency (DF) and an Inverse Document Frequency (IDF) of a word. That is, a word that does not appear constantly, that is, a low-frequency word is considered as a word of no importance. But the importance of a word cannot be determined precisely on the basis of the calculated DF or IDF. For example, a result calculated based on a body is as follows: the IDF of a word "illuminate" is 2.89, while the IDF of a word "ha ha" is 4.76. In addition, as for unstructured data, for example, Q&A platform data and blog data, a low-frequency word can be a wrong word, for example, a wrong “asfsdfs- fda” entry entered by a user, or “ Gao Qi too ”(segmented from a sentence“ Gao QI also has hope for the new dynasty ”).
[0008] Additionally, during document categorization, characteristic value methods such as Gain Information (IG) and X2 are usually used to assess the contribution of a word to a category. However, only characteristics whose values classified in the first n will be selected as effective characteristics, where n is an integer and can be selected according to a requirement for word quality extraction and evaluation. Therefore, a category weight is calculated based on TF-IDF, where TF represents Term Frequency. IG and / 2-based methods are used only to select a characteristic word. They work well with respect to structured data and in small quantities. But, with respect to the mass of unstructured data, a singular-looking assessment cannot fully reflect the importance of a word and cannot effectively calculate the importance of the word. For example, based on the same body, the / 2 of the word “from” is 96292.63382, while the / 2 of “Jingzhou” is only 4445.62836. However, it is not obvious that the word "Jingzhou" whose / 2 is smaller is more important. SUMMARY OF THE INVENTION
[0009] Modalities of the present invention provide a method and an apparatus for extracting and evaluating word quality, to determine the importance of a word accurately.
[00010] In accordance with an embodiment of the present invention, a method for extracting and evaluating word quality is provided. The method includes: calculating a Document Frequency (DF) of a categorized bulk data word; evaluate the word in multiple singular aspects according to the DF of the word; and to evaluate the word in a multiple aspect according to the evaluations in the multiple singular aspects to obtain a weight of importance of the word.
[00011] In accordance with another embodiment of the present invention, an apparatus for extracting and evaluating word quality is provided. The apparatus includes: a unit for calculating DF, adapted to calculate the DF of a categorized mass word of data; a unit for evaluating a single aspect, adapted to evaluate the word in multiple singular aspects according to the DF of the word; a unit for evaluating multiple aspects, adapted to evaluate the word in a multiple aspect according to the multiple evaluation of singular aspects to obtain a weight of importance of the word.
[00012] Modalities of the present invention provide an improved solution based on the theory of probability and theory of entropy. The entry is categorized data mass. The output is high quality words. According to the solution of the present invention, the importance of the word in a categorized mass of data can be assessed, and words with high quality can be obtained through an integrated assessment.
[00013] The solution of the present invention can be applied to various scenarios of extraction and word quality assessment. For example, when applied to search engine data, the solution of the present invention can extract high-quality words with precision. High-quality words can be used to classify search relativity and analyze the user's search snippet. As another example, when applied to an interactive platform, blog or news platform, the solution of the present invention can accurately extract a label word from the text. Therefore, accurate and high quality label words can be obtained to analyze user actions, which facilitates the personification and recommendation of the user. Additionally, when applied to document categorization, grouping and summarization, the solution can perform precise extraction of characteristic words to extract information from text. The solution can also be applied in garbage filtering and ad classification, to efficiently extract keywords related to the category. BRIEF DESCRIPTION OF THE FIGURES
[00014] Figure 1 is a flow chart that illustrates a method for implementing word quality extraction and assessment according to a modality of the present invention.
[00015] Figure 2 is a schematic diagram illustrating a comparison of a linear normalization curve and a logarithmic normalization curve according to an embodiment of the present invention.
[00016] Figure 3 is a schematic diagram illustrating a structure of an apparatus for implementing word quality extraction and assessment according to a modality of the present invention.
[00017] Figure 4A is a schematic diagram illustrating a first structure of a unit for determining quality according to an embodiment of the present invention.
[00018] Figure 4B is a schematic diagram illustrating a second unit structure for determining quality according to an embodiment of the present invention. DETAILED DESCRIPTION OF THE INVENTION
[00019] In embodiments of the present invention, a DF of a categorized bulk word is calculated, multiple assessments of singular aspects of the word are performed according to the DF, and a multiple aspect assessment of the word is performed according to the multiple evaluations of singular aspects to obtain a weight of importance of the word.
[00020] Figure 1 is a flow chart that illustrates a method for implementing word quality extraction and assessment according to one embodiment of the present invention. As shown in figure 1, the method includes the following steps.
[00021] In step 101, the DF of a categorized bulk data word is calculated.
[00022] In the embodiment of the present invention, what is entered is the categorized data mass. The categorized data mass refers to data from mass documents that have been classified into different categories. For example, the categorized data mass can be news data classified by technique, sports, entertainment. As another example, the categorized data mass can also be data from an interactive Q&A platform classified by computer, education and games.
[00023] The calculation of the word DF is the first step for an extraction and quality assessment. The purpose of the calculation is to obtain a statistic required in the subsequent calculation. The calculation of the word DF in the categorized data mass includes mainly: calculating a vector of the word DF in each category of the categorized data mass and calculating the word DF in all categories.
[00024] Before the DF of the word is calculated, words are obtained by segmenting the categorized data mass and some processing can be performed for the words previously, for example, standardize traditional characters and simplified characters, standardize upper and lowercase characters, unify full-width and half-width characters, so that the words used for extraction and quality assessment can have a uniform format.
[00025] The DF vector of the word in each category of the categorized data mass is calculated to obtain a vector FW = {df1, df2, ..., dfn}, where dft represents the DF vector of the word w in the category /, / = 1,2, .... n, n represents the number of categories. For example, there are two categories: computer and sports, the DF vectors of the word "computer" in the two categories are respectively 1191437 and 48281. Therefore, a DF vector of the word "computer" is expressed as {1191437, 48281}.
[00026] The DF of the word w in all categories is calculated. Specifically, the DF of the word w is a sum of the DF vectors of the word w in all categories, that is, DF df, i = 1, 2 ..., n, where n is the number of categories.
[00027] In step 102, the word is evaluated in multiple singular aspects based on the word's DF.
[00028] After the DF of the word is calculated, multiple evaluations of singular aspects of the word will be performed based on the theory of probability and theory of entropy. In particular, the following aspects can be considered. (1) Inverse Document Frequency (IDF)
[00029] IDF is to assess the quality of the word in all data categorized based on the DF of the word. Specifically, it is expressed as
where DF represents the word's DF in all categorized data, that is, DF df. (2) Average Reverse Document Frequency (AVAIDF)
[00030] AVAIDF is an average of the word's IDF in each category, expressed as
on what
and the number of categories.
[00031] The problem with IDF and AVAIDF methods is as follows: with respect to high frequency words, the evaluation values, that is, both the IDF (w) and AVADIF (w) are low; however, with respect to low frequency words, both assessment values are high. Therefore, if word quality extraction is performed only on the basis of IDF and AVAIDF, the evaluated result will be less accurate. (3) chi-square X2
[00032] Chi-square X2 is used to assess relativity between a word and the category, expressed as
, where A represents a real DF value of the word w in a certain category, T represents a theoretical DF value of the word w in the category, 0 represents a limit of the theoretical DF value, and À represents a correction factor. In the scanning stage 101, dffé A. Therefore, the chi-square formula is expressed as
, where, i = 1, 2, n, n represents the number of categories.
[00033] The chi-square method has the following problem: the chi-square of a high-frequency word and that of a low-frequency word are not comparable since the numerator and de-nommador of each item
, that is,
have different magnitudes. Therefore, the high frequency chi-square is usually high and the low frequency chi-square is usually low. Therefore, the importance of the word cannot be determined by comparing chi-squares. Additionally, as for a word with an extremely low frequency, the result of the chi-square method is less accurate. (4) Gain Information (IG)
[00034] IG is used to evaluate the amount of information provided by a word for the category.
[00035] A universal formula for GI includes two parts: an entropy of the entire category, and an expected value of an entropy of the distribution of each attribute of characteristic F, expressed as
When the word's importance is assessed, attributes of characteristic F usually include {appear in the category, not appear in the category}. Therefore, GI expresses a difference between the entropy of the whole category and the entropy of the whole category after considering the word.
[00036] When the IG method is adopted, the detailed expression is as follows
on what,
represents category, i = 1, 2, n, n represents the number of categories.
[00037] The formula includes three parts: the first part
is a negative entropy value for the entire category, which corresponds to Entropy (C) ', the second part
it is a product of entropy including the word w and the probability that the word will appear; the third part
it is the product of entropy without the word ive the probability that the word w will not appear. The second part and the third part constitute
together.
[00038] The GI method has the following problem: as for a word with a very high frequency and a word with a very low frequency, the distribution of the two attributes {appear in the category, not appear in the category} is seriously unbalanced. The IG values are both close to 0. It is impossible to differentiate the two words simply according to the IG values. Therefore, with respect to the above problem, an embodiment of the present invention provides an improved solution based on a principle that attributes must be distributed in a balanced way and the importance of the word must be reasonably reflected.
[00039] First, all candidate words are classified within different ranges according to the DF, in which ways such as logarithmic gradient, linear gradient, exponential gradient, logarithmic and linear gradient, or exponential and linear gradient can be adopted to classify the candidate words.
[00040] From now on, the logarithmic gradient is taken as an example to describe the classification of words.
[00041] A DF vector of the word w1 in category c1; it is df1. Calculate [log (df1)] to obtain an interval
map the word w; to the interval, that is,
where step represents gradient, it is usually an integer and can be configured according to a GI precision requirement; v represents rounding down from x, that is, a greater integer not greater than x. Therefore, the DF vectors of words in each interval are within a certain interval.
[00042] After the words are classified based on the DF, the IG (W} of the word is calculated based on each interval, that is, when ZG (W) is calculated, the calculation is not based on all categorized data , but based on the categorized data that corresponds to the range.
[00043] Finally, the importance of the word is obtained based on the interval and the IG of the word mapped within the interval. The word's GI can be unified at a uniform interval, for example, [low, high] according to the importance of the word. Therefore, the importance of the word can be obtained according to the position of the IG in the interval.
[00044] From the above, it can be seen that, through the classification of words in intervals based on the DF, the distribution of the attributes {appear in the category, not appear in the category} of the word becomes relatively balanced, therefore the importance of the word can be more accurately determined. (5) Mutual Information (Ml)
[00045] Ml is also used to assess the relativity between the word and the category, expressed as
, where A represents the actual DF value of the word win in a certain category, that is, dft; T represents a theoretical DF value of the word w in the category. (6) Expected Cross Entropy (ECE)
[00046] ECE is used to reflect a distance between category distribution probabilities before and after the word w appears, expressed as
, on what
, c represents category, / = 1, 2, n, n represents the number of categories. (6) Entropy (ENT)
[00047] ENT is used to reflect a uniform distribution of the word w in all categories. The smaller the ENT, the less uniformly the word w is distributed in all categories. Such a word is more likely to belong to a specific field and is therefore more important. The specific expression of
, on what
, n represents the number of categories.
[00048] All Ml, ECE and ENT methods have the following problem: they only consider the difference that the word distributes in different categories, but it does not consider the probability that the word will appear. In fact, however, if the word's DF is low, the word has a low probability of appearing and a reliability of the word's distribution in the categories must be relatively low. (8) Selective preference (SELPRE)
[00049] SELPRE is used to assess a degree of concentration of the meaning of a word, that is, the ability of the word to be used with other words.
[00050] Usually, an important word with concentrated meaning can be used with only a few special words, while a generalized word can be used with multiple words. Therefore, a two-part word usage distribution is calculated first. In the embodiment of the present invention, it is possible to configure which nouns can be used with verbs and adjectives, adjectives can be used with nouns, and verbs can be used with nouns. The word SELPRE is expressed as
, where P (m / w) represents a conditional probability that the word w can be used with the word m, and P (m) represents the probability that the word w and the word m are used together.
[00051] The problem with the SELPRE method is as follows: it does not consider the difference between the categories. Therefore, it is impossible to determine whether a word is a special word in a certain field according to SELPRE.
[00052] In the evaluation methods above, except for the ENT method, the higher the evaluation value, the more important the word is. From the above it can be seen that, no matter which single method is adopted, it is impossible to obtain an accurate result. Therefore, it is necessary to combine the valuation of singular aspect values effectively. Therefore, a weight of importance that can accurately reflect the quality of the word can be obtained through an integrated assessment.
[00053] In step 103, a multiple aspect assessment of the word is performed based on the multiple assessment of singular aspects to obtain a weight of importance of the word.
[00054] Specifically, candidate words are classified at different levels according to their DFs. A way of evaluating the multiple aspect of each candidate word is determined according to the level of the candidate word to obtain a weight of importance of the candidate word. Processing is described in further detail hereinafter.
[00055] First, classify the candidate words at four levels according to the DFs of the candidate words in all categorized data. The four levels are respectively: superhigh frequency word, medium high frequency word, medium low frequency word and super low frequency word. The superhigh frequency word refers to a word with a very high DF that appears in most documents. The super low frequency word refers to a word with a very low DF that appears only in very few documents. The medium high frequency word refers to a word whose DF lies between the super high frequency word and the super low frequency word. Although the DF of the medium high frequency word is lower than that of the super high frequency word, it is relatively high and the medium high frequency word appears in many documents. The medium low frequency word refers to a word whose DF is between the super high frequency word and the super low frequency word. Although the DF of the medium low frequency word is relatively low, it is still higher than that of the super low frequency word. The low-frequency word appears in some documents. The four levels can be identified as: SuperHigh, MediumHigh, MediumLow and SuperLow. In the embodiment of the present invention, it is not restricted to the four levels above. When the levels are determined according to the DF, different ways such as logarithmic gradient, linear gradient, exponential gradient, logarithmic and linear gradient, and exponential and linear gradient can be adopted. Different levels can have different scopes.
[00056] Therefore, the word is classified at a corresponding level according to the DF in all categorized data.
[00057] Next, a way of evaluating multiple aspects is obtained based on the evaluation of singular aspects obtained in step 102.
[00058] The IDF and AVAIDF methods are both based on the DF. Therefore, both the IDF and AVAIDF methods do not make much of a contribution to differentiating the importance of words at the same level classified according to the DF. But the absolute value of the difference between IDF and AVAIDF, that is, | IDF (w) -AVAIDF (w) | can reflect a difference in word distribution across different categories, to reflect whether the word is important. Therefore, the formula Diff (w) - | AVAIDF (w) - IDF (w) | is obtained. This way of integrated evaluation effectively overcomes the defect that a way of singular evaluation cannot accurately determine the importance of the word at the SuperHigh level and at the Super-Low level. For example, with respect to a word “illuminate”, r> z # (iighten) = | 5.54-2.891 = 2.65, while with respect to the word “ha ha”, r> z # (haha) = | 5.16-4.761 = 0.4. This is due to the word “illuminate” appearing a lot in some categories but rarely appearing in others. However, the word “ha ha” appears a lot in each category. An important word can be accurately determined by Diff (w). The higher the Diff (w) value, the more important the word is.
[00059] The Ml, ECE and ENT methods are all based on the probability of word distribution in each category. Therefore, these three methods can be used together to assess the importance of the word. Specifically, Ml (w), ECE (w) and ENT (w) are linearly normalized. And since ENT is an inverse relationship with the importance of the word, downward normalization is required. ThenNormLineari (MI (w)), NormLinearfECEfw)) and NormLinearDesc (ENT (w)), are obtained. A linear combination of the three above is taken as a basis for evaluation, expressed

[00060] The IG and chi-square methods are related to both DF and the probability of word distribution in each category. Therefore, these two methods can be combined to determine the importance of a word. Specifically, / 2 (M ') and IG (w) are logarithmic normalized to obtain
, and then they are combined to obtain

[00061] The SELPRE method is based on the relationship of words. It is used as a way of independent evaluation. It is expressed as = Nc, rtriLl> 'ieats (SELPRE (} v)) after linear normalization.
[00062] Some of the above ways are based on DF, while some of them are based on the probability of word distribution. Therefore, the assessment values have different intervals. Consequently, the assessment values must be normalized within a range. In a modality of the present invention, the linear normalization method and the logarithmic normalization method are adopted. A comparison of the two methods is shown in figure 2. As shown in figure 2, at their original intervals, the two methods have different trends for change. If the variable x is a function of a logarithm of probability or a function of a logarithm of DF, the linear normalization method is generally adopted; otherwise, the logarithmic normalization method is adopted. In addition, the standardization method can be selected according to experience in data analysis.
[00063] Linear normalization is used to map an interval to another interval using a linear method. The formula is expressed by NormLinear (x) = kx + b, in qUek> 0, x is Ml (w), ECE (w) and SELPRE (w). The logarithmic normalization method is used to map one interval to another using a logarithmic method. The formula is expressed as NormLog (x) = log (br + £), where que> 0, x is z2 (ii ') and IG (w). The two methods above are ascending, that is, k> 0. If k <0, the same is a descending method. The formula adopted is NormL> neaiDesc (x) = kx + b or NormLogDescfx) = log (fcr + à), where xθ ENT (W). The values of k and b can be calculated according to the ends of the range after mapping.
[00064] After obtaining the integrated assessment methods, the assessment method of the multiple aspect of the word can be determined according to the level of the word. In this document, corresponding multiple aspect assessment ways are configured respectively for the four levels.
[00065] As for words at the SuperHigh level and the MediumHigh level, all of the integrated assessments above are reliable. Therefore, multiple assessment can adopt a way of linking, expressed as

[00066] As for words on the MediumLow level, the DF is not high and there are few words that can be used together, the integrated assessment method is less reliable. Therefore, the way of evaluating the multiple aspect of words at the MediumLow level is expressed as

[00067] As for words at the SuperBass level, the IG method and the chi-square method are both less reliable and there are very few words that can be used together. Therefore, the SEL-PRE method is not considered. Consequently, the way of evaluating multiple aspect of words at the SuperBass level is expressed as

[00068] After the way of evaluating the multiple aspect of the word is determined according to the level of the word, the defects of the singular aspect evaluation mentioned above in step 102 are overcome. Hereafter, the multiple aspect assessment of words at the high frequency level (including the SuperHigh level and the Medium-High level) and the SuperLow level are described, where differentiations at these levels are more difficult.
[00069] At the high frequency level, two words "illuminate" and "ha ha" are considered. Although the IDFs of the two words are close, the word “illuminate” appears more in the “QQ games” category, while “ha ha” appears uniformly in all categories. Therefore, the two words can be differentiated using Diff (w). Additionally,% - of “illuminate” is 1201744, and / 2 of “ha ha” is 3412. After / 2 (w) is normalized, the difference between them is even greater. It is basically the same situation with respect to the IG. Therefore, the importance of the two words can also be clearly differentiated through ProbDFRel (w). At the same time, ProbBaseada (w) is mainly used to determine the uniformity of the word distribution in all categories. It can also differentiate the two words. As for SelPre (w), “ha ha” is a more generalized word and can be used in conjunction with many words. However, “illuminate” is usually used in icons and context related to a QQ product. Therefore, the multiple aspect rating for “lighting up” is 9.65, while the multiple aspect rating for “ha ha” is 1.27. Therefore, it can be determined that “enlighten” is the word for high quality, and “ha ha” is the word for low quality.
[00070] At the SuperBass level, a word “Chujiangzhen” (a city in Hunan province) a word entered randomly “fdgfdg” are considered. Both have a very low DF, and their IDFs are both approximately 14. But the word "Chujiangzhen" appears most often in the "region" category, while "fdgfdg" can appear in all categories. Therefore, Diff (Chujiangzhen) - 2.12 and Diff (fdgfdg) = 1.05. Although the / 2 of “Chujiangzhen” and the / 2 of “fdgfdg” are both small, they can be differentiated taking Diff (w) into account. At the same time, the ProbBase (w) from “Chujiangzhen” is obviously larger than the ProbBase (w) from “fdgfdg”. Finally, it is obtained that the multiple aspect rating of “Chujiangzhen” is 9.71, and the multiple aspect rating of “fdgfdg” is 1.13. Therefore, it can be determined that “Chujiangzhen” is the high quality word and “fdgfdg” is the low quality word.
[00071] In view of the above, the method of combining multiple aspect assessment and level classification based on DF makes it possible to determine the importance of a word according to the integrated assessment method of the corresponding level. The above SuperAlta (w), MediumAlta (w), MediumLow (w) and SuperLow (w) obtained in each level are the weight of importance of the word in the corresponding level, and can be expressed as WgtPart (w) in general.
[00072] Step 104, determines the quality of the word according to the weight of the word's importance.
[00073] After the word importance weight is obtained, the word quality can be determined according to the word importance weight, as well as to obtain high quality words for subsequent use in document processing.
[00074] A processing method is as follows:
[00075] First, respectively, set a limit of importance a and a limit β constantly used for each level. These two limits can be configured according to an extraction and evaluation requirement. If many important words are required, the can be configured smaller; otherwise, the can be configured higher. If it is required to classify many words for a range used constantly, β can be set higher; otherwise, β can be configured lower. If there are four levels configured in step 103, a pair of a and β must be configured for each level. As a result, there are four pairs of a and β throughout.
[00076] Next, determine the quality of the word at each level according to a relationship between the two limits above the level and the weight of the word's importance at the level. The word quality at each level can be expressed as

[00077] After the above processing, what is obtained is merely the quality of the word at the level. However, when a document is analyzed by selecting an important word and a common word, a standard uniform format is usually required to differentiate functions from different words.
[00078] After the candidate words are classified in levels according to the DF, the words in each level are classified effectively according to their importance. But extreme values of WgtPart (w) at different levels are different. Therefore, normalization processing is required, that is, normalize WgtPart (w) of each level to obtain an integrated weight of importance Wgt (w) of the word. For example, an integrated weight of importance Wgt (w) = NormLinear (WgtPart (w)) can be obtained through linear normalization. Additionally, logarithmic normalization can also be adopted to obtain the word's integrated importance weight.
[00079] Finally, based on the Wgt (w) obtained by normalization processing, with respect to words of the same quality at different levels, an integrated quality classification is performed. For example, in step 103, four levels are obtained. Then, with respect to the words on the four levels whose qualities are important, carry out an integrated quality classification. A very important limit εi and a limit of common importance ε Are set for the levels after normalization processing. All words classified by quality, expressed as
Similarly,


[00080] Another method of processing is as follows.
[00081] Since the extreme values of WgtPart (w) at different levels are different, the WgtPart (w) of words at different levels are not comparable. Therefore, another normalization processing is required, that is, the WgtPart (w) of each level needs to be normalized to obtain an integrated word weight. For example, an integrated weight of the word Wgt (w) = UneNorm (WgtPart (wy) can be obtained through linear normalization. Additionally, logarithmic normalization can also be adopted to obtain the integrated weight of the word.
[00082] Then, set a limit of importance to 'and a limit used constantly β' after normalization processing. According to a relationship between the two limits above and the word's weight of integrated importance, the word is classified, expressed as

[00083] The steps above can be performed on any device to perform word quality extraction and assessment, for example, computer, web server, which is not restricted in the modalities of the present invention.
[00084] Figure 3 is a schematic diagram that illustrates a structure of an apparatus for extracting and evaluating word quality according to an embodiment of the present invention. As shown in figure 3, the device includes: a unit for calculating DF, a unit for evaluating single aspect and a unit for evaluating multiple aspect, in which a unit for calculating DF is adapted to calculate the DF of the categorized data word ; the singular aspect evaluation unit is adapted to evaluate the word in a singular aspect according to the word's DF; the unit for evaluating multiple aspects is adapted to evaluate the word in multiple aspects according to multiple evaluations of singular aspects of the word to obtain a weight of importance of the word.
[00085] The apparatus may additionally include a pre-processing unit, adapted to pre-process words from the categorized data mass, for example, unification of traditional characters and simplified characters, unification of upper and lower letters, unification of characters of half width and full width, to standardize words and make words uniform.
[00086] The apparatus may additionally include a unit for determining quality, adapted to determine the quality of the word according to the weight of importance of the word.
[00087] A unit to calculate DF includes: a module to calculate DF vector and a module to calculate DF, where a module to calculate DF vector is adapted to calculate a word DF vector in each category of data mass categorized; a module for calculating DF is adapted to obtain a sum of the DF vectors of the word as the DF of the word in all categorized data.
[00088] The unit to evaluate a single aspect includes multiple modules, each of which is used to implement a single aspect evaluation. The unit for evaluating singular aspect may include: an IDF module, an AVAIDF module, a chi-square module, an IG module, an Ml module, an ECE module, an ENT module and a SELPRE module. Specifically, the IG module can include an interval division module and a module for calculating the IG. The range division module is adapted to classify all candidate words within different ranges according to their DFs. The module to calculate the GI is adapted to calculate the GI of the word based on the categorized data that correspond to a range of the word. When the interval division module classifies the candidate words, methods such as logarithmic gradient, linear gradient, exponential gradient, logarithmic and linear gradient, or exponential and linear gradient can be adopted.
[00089] The unit for evaluating multiple aspects includes: a level division module and a module for determining how to evaluate multiple aspects. The level division module is adapted to classify candidate words within different levels according to the DFs of the words. The unit for determining the way of evaluating the multiple aspect is adapted to determine the way of evaluating the multiple aspect of the word according to the level of the word to obtain the weight of importance of the word at the corresponding level. The level division module can include: a level interval division module and a word classification module. The interval division module is adapted to configure intervals according to the DFs of the words in all categorized data. The word classification module is adapted to classify the word at a corresponding level according to the word's DF in all categorized data.
[00090] The unit for determining quality can include: a module for configuring threshold, a module for determining the quality of the level, a module for processing standardization and a module for integrated classification, as shown in figure 4A. The limit setting module is adapted to set an importance limit and a limit used constantly for each level, in which the levels are configured according to the DFs of the words in all categorized data. The module to determine the quality of the level is adapted to determine the quality of the word on the level according to a relationship between the two limits and the weight of importance of the word on the level. The normalization processing module is adapted to normalize the weight of importance of the word at each level to obtain an integrated weight of importance of the word. The integrated classification module is adapted to classify words of the same quality at different levels according to the words' integrated importance weight.
[00091] Alternatively, the unit for determining quality may also include: a module for processing standardization, a module for configuring threshold and a module for integrated classification, as shown in figure 4B. The normalization processing module is adapted to normalize the weight of word importance at each level to obtain an integrated weight of word importance, where the level is configured according to the DFs of the words in all categorized data. The limit setting module is adapted to set a limit of importance and a limit used constantly for each level. The integrated classification module is adapted to classify the word according to a relationship between the two limits and the word's integrated importance weight.
[00092] What has been described and illustrated in this document is a preferred example of the revelation along with some of its variations. The words, descriptions and figures used in this document are presented for illustration only and do not mean limitations. Many variations are possible within the spirit and scope of the revelation, which is understood to be defined by the following embodiments - and their equivalents - where all words have meaning in their broadest reasonable sense unless otherwise indicated.
权利要求:
Claims (14)
[0001]
1. Word extraction and assessment method, which comprises: calculating a Document Frequency (DF) of a categorized mass of data (101); evaluate the word in multiple singular aspects according to the DF of the word (102); and classify candidate words in levels according to the DFs of the candidate words, in which the levels comprise a SuperHigh level, a MediumHigh level, a MediumLow level and a Super-Low level; and characterized by the fact that it still comprises: for each candidate word at the SuperAlto level, at the MediumHigh level or at the MediumLow level, determine the weight of importance of the candidate word according to: an absolute value or a difference between Average Document Reverse Frequency (AVAIDF ) and an Inverse Document Frequency (IDF) of the candidate word, a linear combination of Mutual Information (Ml), Expected Cross Entropy (ECE) and Entropy (ENT) of the candidate word, a combination of chi-square with logarithmic normalization and Information Gain (IG) of the candidate word, and Selective Preference (SELPRE) with logarithmic normalization of the candidate word; for each candidate word at the SuperBass level, determine the weight of importance of the candidate word according to: an absolute value or a difference between Average Document Reverse Frequency (AVAIDF) and an Inverse Document Frequency (IDF) of the candidate word, a combination linear Mutual Information (Ml), Expected Cross Entropy (ECE) and Entropy (ENT) of the candidate word; and a combination of chi-square with logarithmic normalization and Gain Information (IG) of the candidate word.
[0002]
2. Method, according to claim 1, characterized by the fact that the calculation of the word DF in a categorized data mass comprises: calculating a vector of the word DF in each category of the categorized data mass; and obtaining a sum of DF vectors of the word in all categories as a DF of the word in all categories.
[0003]
3. Method, according to claim 1, characterized by the fact that the singular aspect evaluation comprises one or more of: Inverse Document Frequency (IDF), Average IDF (AVAIDF), chi-square, Gain Information (IG), Mutual Information (Ml), Expected Cross Entropy (ECE), Entropy (ENT) and selective preference (SELPRE).
[0004]
4. Method, according to claim 3, characterized by the fact that the evaluation of singular aspect is the IG, the evaluation of the word according to the DF of the word comprises: classify all candidate words within intervals according to the DFs of the candidate words; and calculate the word's GI based on categorized data that corresponds to a word range.
[0005]
5. Method, according to claim 1, characterized by the fact that the classification of candidate words in levels according to the DFs of the candidate words comprises: determining the levels according to the DF of each word in all categorized data; and classify each word at a corresponding level according to the word's DF in all categorized data.
[0006]
6. Method according to any one of claims 1 to 5, characterized by the fact that it still comprises: before calculating the word DF in the categorized data mass, process the word in the categorized data mass in advance; and / or, after determining the weight of importance of the word, determining a quality of the word according to the weight of importance of the word.
[0007]
7. Method, according to claim 6, characterized by the fact that determining the quality of the word according to the weight of the word's importance comprises: setting a limit of importance and a limit used constantly for each level, in which the levels they are obtained according to the DFs of the words in all categorized data; determine a word quality at the corresponding level according to a relationship between the two limits and the weight of the word's importance at the level; normalize the weight of importance of the word at each level to obtain an integrated weight of importance of the word; based on the weight of the word's integrated importance, carry out an integrated quality classification for words of the same quality at different levels; or, normalize the weight of importance of the word in each level to obtain an integrated weight of importance of the word, in which the level is obtained according to the DFs of the words in all categorized data; configure an importance limit and a constant usage limit; perform an integrated classification of quality for the word according to a relationship between the two limits and the weight of integrated importance.
[0008]
8. Apparatus for word extraction and evaluation, characterized by the fact that it comprises: a unit to calculate DF, adapted to calculate a DF of a word in the categorized data mass; a unit for evaluating a single aspect, adapted to evaluate the word in multiple singular aspects according to the DF of the word; a unit to evaluate multiple aspects, adapted to evaluate the word in a multiple aspect according to the multiple evaluations of singular aspects to obtain a weight of importance of the word; where the unit for evaluating multiple aspects comprises: a level division module, adapted to configure levels according to the candidate word DFs, in which the levels comprise a SuperHigh level, a MediumHigh level, a MediumLevel level and a SuperLevel level ; and a module to evaluate multiple aspects, adapted for, for each candidate word in the SuperHigh level, in the MediumHigh level or in the MediumLow level, determine the weight of importance of the candidate word according to: an absolute value or a difference between Inverse Document Frequency Mean (AVAIDF) and Inverse Document Frequency (IDF) of the candidate word, a linear combination of Mutual Information (Ml), Expected Cross Entropy (ECE) and Entropy (ENT) of the candidate word, a combination of chi-square with logarithmic normalization and Gain Information (IG) of the candidate word, and Selective Preference (SELPRE) with logarithmic normalization of the candidate word; for each candidate word at the SuperBass level, determine the weight of importance of the candidate word according to: an absolute value or a difference between Average Document Reverse Frequency (AVAIDF) and an Inverse Document Frequency (IDF) of the candidate word, a combination linear Mutual Information (Ml), Expected Cross Entropy (ECE) and Entropy (ENT) of the candidate word; and a combination of chi-square with logarithmic normalization and Gain Information (IG) of the candidate word.
[0009]
9. Apparatus according to claim 8, characterized by the fact that a unit for calculating DF comprises: a module for calculating DF vector, adapted to calculate a DF vector of the word in each category of categorized data; and a module for calculating DF, adapted to obtain a sum of vectors of DF of the word as the DF of the word in all categories.
[0010]
10. Apparatus according to claim 8, characterized by the fact that the unit for evaluating singular aspect comprises: an Inverse Document Frequency (IDF) module, an Average IDF module (AVAIDF), a chi-square module, a Information Gain module (IG), a Mutual Information module (Ml), an Expected Cross Entropy module (ECE), an Entropy module (ENT) and a Selective Preference module (SELPRE).
[0011]
11. Apparatus, according to claim 10, characterized by the fact that the GI module comprises: an interval division module, adapted to configure intervals according to the DFs of all candidate words; and a module for calculating the GI, adapted to calculate a GI of the word according to the categorized data that correspond to the word interval.
[0012]
12. Apparatus according to claim 8, characterized by the fact that the level division module comprises: a level interval division module, adapted to configure levels according to the DFs of the words in all categorized data; and a word classifying module, adapted to classify the word to a corresponding level according to the word DF in all categorized data.
[0013]
13. Apparatus according to one of claims 8 to 12, characterized by the fact that it still comprises: a pre-processing unit, adapted to process the word in a mass of data categorized in advance; and / or a unit to determine quality, adapted to determine the quality of the word according to the weight of importance of the word.
[0014]
14. Apparatus, according to claim 13, characterized by the fact that the unit for determining quality comprises: a module for configuring limit, adapted to configure a limit of importance and a limit used constantly for each level, in which the level is obtained according to the DFs of the words in all categorized data; a module to determine the quality of the level, adapted to determine the quality of the word at the level according to a relationship between the two limits and the weight of importance of the word at the corresponding level; a module for processing normalization, adapted to normalize the weight of importance of the word at each level to obtain an integrated weight of importance of the word; a module for integrated classification, adapted to perform an integrated quality classification for words of the same quality at different levels based on the weight of the word's integrated importance; or, the unit for determining quality comprises: a standardization module, adapted to normalize the weight of importance of the word in each level to obtain an integrated weight of importance of the word, in which the levels are divided according to the DFs of the words in all categorized data; a module to configure the limit, adapted to configure a limit of importance and a limit used constantly; and an integrated classification module, adapted to perform an integrated quality classification for all words based on a relationship between the two limits and the word's importance weight.
类似技术:
公开号 | 公开日 | 专利标题
BR112012011091B1|2020-10-13|method and apparatus for extracting and evaluating word quality
Burger et al.2011|Discriminating gender on Twitter
JP4920023B2|2012-04-18|Inter-object competition index calculation method and system
CN106598999B|2020-02-04|Method and device for calculating text theme attribution degree
WO2018086470A1|2018-05-17|Keyword extraction method and device, and server
US9864795B1|2018-01-09|Identifying entity attributes
CN108920456A|2018-11-30|A kind of keyword Automatic method
CN108197117B|2020-05-26|Chinese text keyword extraction method based on document theme structure and semantics
CN106649250A|2017-05-10|Method and device for identifying emotional new words
JP2009110508A|2009-05-21|Method and system for calculating competitiveness metric between objects
WO2021139262A1|2021-07-15|Document mesh term aggregation method and apparatus, computer device, and readable storage medium
Rohini et al.2016|Domain based sentiment analysis in regional Language-Kannada using machine learning algorithm
Parthasarathy et al.2014|Sentiment analyzer: Analysis of journal citations from citation databases
Budhiraja et al.2020|A supervised learning approach for heading detection
Hofmann et al.2020|Predicting the growth of morphological families from social and linguistic factors
CN110705612A|2020-01-17|Sentence similarity calculation method, storage medium and system with mixed multi-features
US10229194B2|2019-03-12|Providing known distribution patterns associated with specific measures and metrics
Wojtinnek et al.2011|Semantic relatedness from automatically generated semantic networks
US20210073237A1|2021-03-11|System and method for automatic difficulty level estimation
Blekanov et al.2020|The ideal topic: interdependence of topic interpretability and other quality features in topic modelling for short texts
JP6026036B1|2016-11-16|DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM
CN107688562A|2018-02-13|Word detection method, device, system
Jain et al.2017|Text analytics framework using apache spark and combination of lexical and machine learning techniques
Pérez et al.2019|Exploiting user-frequency information for mining regionalisms from social media texts
CN108021595B|2020-07-14|Method and device for checking knowledge base triples
同族专利:
公开号 | 公开日
RU2517368C2|2014-05-27|
CN102054006B|2015-01-14|
WO2011057497A1|2011-05-19|
BR112012011091A2|2016-07-05|
CN102054006A|2011-05-11|
RU2012123216A|2013-12-20|
US8645418B2|2014-02-04|
US20120221602A1|2012-08-30|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US6473753B1|1998-10-09|2002-10-29|Microsoft Corporation|Method and system for calculating term-document importance|
US7024408B2|2002-07-03|2006-04-04|Word Data Corp.|Text-classification code, system and method|
JP4233836B2|2002-10-16|2009-03-04|インターナショナル・ビジネス・マシーンズ・コーポレーション|Automatic document classification system, unnecessary word determination method, automatic document classification method, and program|
CN1438592A|2003-03-21|2003-08-27|清华大学|Text automatic classification method|
RU2254610C2|2003-09-04|2005-06-20|Государственное научное учреждение научно-исследовательский институт "СПЕЦВУЗАВТОМАТИКА"|Method for automated classification of documents|
US20090119281A1|2007-11-03|2009-05-07|Andrew Chien-Chung Wang|Granular knowledge based search engine|
US8577884B2|2008-05-13|2013-11-05|The Boeing Company|Automated analysis and summarization of comments in survey response data|
CN100583101C|2008-06-12|2010-01-20|昆明理工大学|Text categorization feature selection and weight computation method based on field knowledge|CN103186612B|2011-12-30|2016-04-27|中国移动通信集团公司|A kind of method of classified vocabulary, system and implementation method|
CN103885976B|2012-12-21|2017-08-04|腾讯科技(深圳)有限公司|The method and index server of configuration recommendation information in webpage|
CN103309984B|2013-06-17|2016-12-28|腾讯科技(深圳)有限公司|The method and apparatus that data process|
US9959364B2|2014-05-22|2018-05-01|Oath Inc.|Content recommendations|
CN105183784B|2015-08-14|2020-04-28|天津大学|Content-based spam webpage detection method and detection device thereof|
CN105975518B|2016-04-28|2019-01-29|吴国华|Expectation cross entropy feature selecting Text Classification System and method based on comentropy|
CN107463548B|2016-06-02|2021-04-27|阿里巴巴集团控股有限公司|Phrase mining method and device|
CN108073568B|2016-11-10|2020-09-11|腾讯科技(深圳)有限公司|Keyword extraction method and device|
CN107066441A|2016-12-09|2017-08-18|北京锐安科技有限公司|A kind of method and device for calculating part of speech correlation|
CN107169523B|2017-05-27|2020-07-21|鹏元征信有限公司|Method for automatically determining industry category of mechanism, storage device and terminal|
CN107562938B|2017-09-21|2020-09-08|重庆工商大学|Court intelligent judging method|
CN108269125B|2018-01-15|2020-08-21|口碑信息技术有限公司|Comment information quality evaluation method and system and comment information processing method and system|
CN108664470A|2018-05-04|2018-10-16|武汉斗鱼网络科技有限公司|Measure, readable storage medium storing program for executing and the electronic equipment of video title information amount|
CN109062912A|2018-08-08|2018-12-21|科大讯飞股份有限公司|A kind of translation quality evaluation method and device|
CN109255028B|2018-08-28|2021-08-13|西安交通大学|Teaching quality comprehensive evaluation method based on teaching evaluation data credibility|
CN110377709B|2019-06-03|2021-10-08|广东幽澜机器人科技有限公司|Method and device for reducing complexity of robot customer service operation and maintenance|
CN111090997B|2019-12-20|2021-07-20|中南大学|Geological document feature lexical item ordering method and device based on hierarchical lexical items|
CN111079426B|2019-12-20|2021-06-15|中南大学|Method and device for obtaining field document lexical item hierarchical weight|
CN112561500B|2021-02-25|2021-05-25|深圳平安智汇企业信息管理有限公司|Salary data generation method, device, equipment and medium based on user data|
法律状态:
2019-01-15| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|
2019-07-30| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|
2020-05-26| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|
2020-05-26| B15K| Others concerning applications: alteration of classification|Free format text: A CLASSIFICACAO ANTERIOR ERA: G06F 17/30 Ipc: G06F 16/35 (2019.01), G06F 16/31 (2019.01) |
2020-10-13| B16A| Patent or certificate of addition of invention granted|Free format text: PRAZO DE VALIDADE: 10 (DEZ) ANOS CONTADOS A PARTIR DE 13/10/2020, OBSERVADAS AS CONDICOES LEGAIS. |
优先权:
申请号 | 申请日 | 专利标题
CN200910237185.7A|CN102054006B|2009-11-10|2009-11-10|Vocabulary quality excavating evaluation method and device|
CN200910237185.7|2009-11-10|
PCT/CN2010/074597|WO2011057497A1|2009-11-10|2010-06-28|Method and device for mining and evaluating vocabulary quality|
[返回顶部]